Loss curves are a standard actuarial technique for helping insurance companies assess the amount of reserve capital they need to keep on hand to cover claims from a line of business. Claims made and reported for a given accounting period are tracked seperately over time so that more recent periods with less claim development can use the behaviour of policies from earlier periods of time to help predict what total claims will be.
Total claim amounts from a simple accounting period are usually laid out in a single line with each column showing the total claim amount after that period of time. Subsequent accounting periods have less development and so the data takes a triangular shape, hence the term ‘loss triangles’. Using previous patterns, data in the upper part of the triangle can be used to predict values in the unknown lower triangle.
The ChainLadder package provides functionality to generate and use these loss triangle.
In this case study, we take a related but different approach: we model the growth of the losses in each accounting period as an increasing function of time, and use the model to estimate the parameters which determine the shape and form of this growth. We also use the sampler to estimate the values of the “ultimate loss ratio”, i.e. the ratio of the total claims on an accounting period to the total premium received to write those policies.
We will work with two different functional forms for the growth behaviour of the loss curves: a ‘Weibull’ model and a ‘loglogistic’ model:
\[ g(t \, ; \; \theta, \omega) = \frac{t^\omega}{t^\omega + \theta^\omega} \;\; (\text{Weibull}) \]
\[ g(t \, ; \; \theta, \omega) = 1 - \exp\left(-\left(\frac{t}{\theta}\right)^\omega\right) \;\; (\text{Log-logistic}) \]
We will first look at the car insurance loss data from casact.org
### File was downloaded from http://www.casact.org/research/reserve_data/ppauto_pos.csv
ppauto_dt <- fread("data/ppauto_pos.csv")
setnames(ppauto_dt, tolower(names(ppauto_dt)))
head(ppauto_dt)
## grcode grname accidentyear developmentyear developmentlag
## 1: 43 IDS Property Cas Ins Co 1988 1988 1
## 2: 43 IDS Property Cas Ins Co 1988 1989 2
## 3: 43 IDS Property Cas Ins Co 1988 1990 3
## 4: 43 IDS Property Cas Ins Co 1988 1991 4
## 5: 43 IDS Property Cas Ins Co 1988 1992 5
## 6: 43 IDS Property Cas Ins Co 1988 1993 6
## incurloss_b cumpaidloss_b bulkloss_b earnedpremdir_b earnedpremceded_b
## 1: 607 133 226 957 62
## 2: 647 333 129 957 62
## 3: 582 431 38 957 62
## 4: 598 570 19 957 62
## 5: 614 615 0 957 62
## 6: 615 615 0 957 62
## earnedpremnet_b single postedreserve97_b
## 1: 895 0 73044
## 2: 895 0 73044
## 3: 895 0 73044
## 4: 895 0 73044
## 5: 895 0 73044
## 6: 895 0 73044
In terms of modeling the work, we want to ensure the data we work with is a snapshot in time. We
use_grcode <- c(43,353,388,620)
carrier_snapshot_dt <- ppauto_dt %>%
filter(grcode %in% use_grcode
,developmentyear < 1998) %>%
mutate(acc_year = accidentyear
,dev_lag = developmentlag
,premium = earnedpremdir_b
,cum_loss = cumpaidloss_b
,loss_ratio = cum_loss / premium) %>%
select(grcode, acc_year, dev_lag, premium, cum_loss, loss_ratio)
We are looking at four insurers with the GRCODEs above. Before we proceed with any analysis, we first plot the data, grouping the loss curves by accounting year and faceting by carrier.
ggplot(carrier_snapshot_dt) +
geom_line(aes(x = dev_lag, y = loss_ratio, colour = as.character(acc_year))
,size = 0.3) +
expand_limits(y = c(0,1)) +
facet_wrap(~grcode) +
xlab('Development Time') +
ylab('Loss Ratio') +
ggtitle('Snapshot of Loss Curves for 10 Years of\nPrivate Passenger Auto Insurance for Single Organisation') +
guides(colour = guide_legend(title = 'Cohort Year'))
We look at the chain ladder of the data, rather than looking at the loss ratios we just look at the dollar amounts of the losses.
snapshot_dt <- carrier_snapshot_dt %>%
filter(grcode %in% use_grcode[1])
snapshot_dt %>%
select(acc_year, dev_lag, premium, cum_loss) %>%
spread(dev_lag, cum_loss) %>%
print
## acc_year premium 1 2 3 4 5 6 7 8 9 10
## 1: 1988 957 133 333 431 570 615 615 615 614 614 614
## 2: 1989 3695 934 1746 2365 2579 2763 2966 2940 2978 2978 NA
## 3: 1990 6138 2030 4864 6880 8087 8595 8743 8763 8762 NA NA
## 4: 1991 17533 4537 11527 15123 16656 17321 18076 18308 NA NA NA
## 5: 1992 29341 7564 16061 22465 25204 26517 27124 NA NA NA NA
## 6: 1993 37194 8343 19900 26732 30079 31249 NA NA NA NA NA
## 7: 1994 46095 12565 26922 33867 38338 NA NA NA NA NA NA
## 8: 1995 51512 13437 26012 31677 NA NA NA NA NA NA NA
## 9: 1996 52481 12604 23446 NA NA NA NA NA NA NA NA
## 10: 1997 56978 12292 NA NA NA NA NA NA NA NA NA
We also look at the loss ratios in a similar fashion
snapshot_dt %>%
select(acc_year, dev_lag, premium, loss_ratio) %>%
spread(dev_lag, loss_ratio) %>%
print(digits = 2)
## acc_year premium 1 2 3 4 5 6 7 8 9 10
## 1: 1988 957 0.14 0.35 0.45 0.60 0.64 0.64 0.64 0.64 0.64 0.64
## 2: 1989 3695 0.25 0.47 0.64 0.70 0.75 0.80 0.80 0.81 0.81 NA
## 3: 1990 6138 0.33 0.79 1.12 1.32 1.40 1.42 1.43 1.43 NA NA
## 4: 1991 17533 0.26 0.66 0.86 0.95 0.99 1.03 1.04 NA NA NA
## 5: 1992 29341 0.26 0.55 0.77 0.86 0.90 0.92 NA NA NA NA
## 6: 1993 37194 0.22 0.54 0.72 0.81 0.84 NA NA NA NA NA
## 7: 1994 46095 0.27 0.58 0.73 0.83 NA NA NA NA NA NA
## 8: 1995 51512 0.26 0.50 0.61 NA NA NA NA NA NA NA
## 9: 1996 52481 0.24 0.45 NA NA NA NA NA NA NA NA
## 10: 1997 56978 0.22 NA NA NA NA NA NA NA NA NA
We are working with the loss ratio, so we recreate the chain ladder format but look at loss ratios instead of dollar losses.
ggplot() +
geom_line(aes(x = dev_lag, y = loss_ratio, colour = as.character(acc_year))
,data = snapshot_dt
,size = 0.3) +
expand_limits(y = 0) +
xlab('Development Time') +
ylab('Loss Ratio') +
guides(colour = guide_legend(title = 'Cohort Year'))
We want to only use the data at a given snapshot, so we choose all data current to 1998. Thus, we have 10 years of development for our first 1988 cohort, and one less for each subsequent year. Our final cohort for 1997 has only a single year of development
cohort_dt <- snapshot_dt %>%
select(acc_year, premium) %>%
unique() %>%
mutate(cohort_id = 1:n())
usedata_dt <- snapshot_dt
cohort_maxtime <- usedata_dt %>%
group_by(acc_year) %>%
summarise(maxtime = max(dev_lag)) %>%
arrange(acc_year) %>%
.[[2]]
cohort_premium <- usedata_dt %>%
group_by(acc_year) %>%
summarise(premium = unique(premium)) %>%
.[[2]]
t_values <- usedata_dt %>%
select(dev_lag) %>%
arrange(dev_lag) %>%
unique %>%
.[[1]]
standata_lst <- list(growthmodel_id = 1 # Use weibull rather than loglogistic
,n_data = usedata_dt %>% nrow
,n_time = usedata_dt %>% select(dev_lag) %>% unique %>% nrow
,n_cohort = usedata_dt %>% select(acc_year) %>% unique %>% nrow
,cohort_id = get_character_index(usedata_dt$acc_year)
,cohort_maxtime = cohort_maxtime
,t_value = t_values
,t_idx = get_character_index(usedata_dt$dev_lag)
,premium = cohort_premium
,loss = usedata_dt$cum_loss
)
The full Stan file is shown below:
functions {
real growth_factor_weibull(real t, real omega, real theta) {
real factor;
factor = 1 - exp(-(t/theta)^omega);
return(factor);
}
real growth_factor_loglogistic(real t, real omega, real theta) {
real factor;
factor = ((t^omega) / (t^omega + theta^omega));
return(factor);
}
}
data {
int<lower=0,upper=1> growthmodel_id;
int n_data;
int n_time;
int n_cohort;
int cohort_id[n_data];
int t_idx[n_data];
int cohort_maxtime[n_cohort];
real<lower=0> t_value[n_time];
real premium[n_cohort];
real loss[n_data];
}
parameters {
real<lower=0> omega;
real<lower=0> theta;
real<lower=0> LR[n_cohort];
real mu_LR;
real<lower=0> sd_LR;
real<lower=0> loss_sd;
}
transformed parameters {
real gf[n_time];
real loss_mean[n_cohort, n_time];
for(i in 1:n_time) {
if(growthmodel_id == 1) {
gf[i] = growth_factor_weibull (t_value[i], omega, theta);
} else {
gf[i] = growth_factor_loglogistic(t_value[i], omega, theta);
}
}
for(i in 1:n_data) {
loss_mean[cohort_id[i], t_idx[i]] = LR[cohort_id[i]] * premium[cohort_id[i]] * gf[t_idx[i]];
}
}
model {
mu_LR ~ normal(0, 0.5);
sd_LR ~ lognormal(0, 0.5);
LR ~ lognormal(mu_LR, sd_LR);
loss_sd ~ lognormal(0, 0.7);
omega ~ lognormal(0, 1);
theta ~ lognormal(0, 1);
for(i in 1:n_data) {
loss[i] ~ normal(loss_mean[cohort_id[i], t_idx[i]], premium[cohort_id[i]] * loss_sd);
}
}
generated quantities {
real mu_LR_exp;
real<lower=0> loss_sample[n_cohort, n_time];
real<lower=0> loss_prediction[n_cohort, n_time];
real<lower=0> step_ratio[n_cohort, n_time];
for(i in 1:n_cohort) {
for(j in 1:n_time) {
loss_sample[i, j] = LR[i] * premium[i] * gf[t_idx[j]];
step_ratio [i, j] = 1.0;
}
}
mu_LR_exp = exp(mu_LR);
for(i in 1:n_data) {
loss_prediction[cohort_id[i], t_idx[i]] = loss[i];
}
for(i in 1:n_cohort) {
for(j in 2:n_time) {
step_ratio[i, j] = gf[t_idx[j]] / gf[t_idx[j-1]];
}
}
for(i in 1:n_cohort) {
for(j in (cohort_maxtime[i]+1):n_time) {
loss_prediction[i,j] = loss_prediction[i,j-1] * step_ratio[i,j];
}
}
}
Need to discuss a number of issues with the above file:
lc_1_stanfit <- stan(file = 'losscurves_single.stan'
,data = standata_lst
,iter = 500
,chains = 8
)
The Stan sample contains no divergent transitions, so that is a good start.
It is always worth checking convergence of the model by checking the \(\hat{R}\) and ensuring it is less than about 1.1
# Plot of convergence statistics
lc_1_draws <- extract(lc_1_stanfit, permuted = FALSE, inc_warmup = TRUE)
lc_1_monitor_tbl <- as.data.frame(monitor(lc_1_draws, print = FALSE))
lc_1_monitor_tbl <- lc_1_monitor_tbl %>%
mutate(variable = rownames(lc_1_monitor_tbl)
,parameter = gsub("\\[.*]", "", variable)
)
ggplot(lc_1_monitor_tbl) +
aes(x = parameter, y = Rhat, color = parameter) +
geom_jitter(height = 0, width = 0.2, show.legend = FALSE) +
geom_hline(aes(yintercept = 1), size = 0.5) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5)) +
ylab(expression(hat(italic(R))))
## Warning: Removed 110 rows containing missing values (geom_point).
The \(\hat{R}\) values for the parameter appear to be in or around 1, so that is encouraging.
Finally, we look at some of the n_eff variable also
ggplot(lc_1_monitor_tbl) +
aes(x = parameter, y = n_eff, color = parameter) +
geom_jitter(height = 0, width = 0.2, show.legend = FALSE) +
expand_limits(y = 0) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5)) +
xlab("Parameter") +
ylab(paste0("Effective Sample Count (n_eff)"))
param_root <- c("omega", "theta", "LR", "gf", "loss_sd")
use_vars <- lc_1_monitor_tbl %>%
filter(parameter %in% param_root) %>%
.[["variable"]]
Check the traces. There are a large number of parameters in this model fit, so we break them up into groups. First we look at omega, theta and LR
rstan::traceplot(lc_1_stanfit, pars = c("omega", "theta", "LR")) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5))
Now we look at the traces for gf and loss_sd.
rstan::traceplot(lc_1_stanfit, pars = c("gf", "loss_sd")) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5))
Now we check the 50% credibility intervals of the parameters
plotdata_dt <- lc_1_monitor_tbl %>%
filter(variable %in% use_vars) %>%
select(mean, `25%`, `50%`, `75%`) %>%
mutate(variable = factor(use_vars, levels = use_vars))
ggplot(plotdata_dt) +
geom_point(aes(x = variable, y = mean)) +
geom_errorbar(aes(x = variable, ymin = `25%`, ymax = `75%`), width = 0) +
expand_limits(y = 0) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5)) +
xlab("Parameter") +
ylab("Value")
Now that we have our fit we can start looking at some plots for it. First we look at some very simple sanity-check plots. We look at the full development of an accounting year and see how our fit is doing at tracking that.
fitted_curves_tbl <- extract(lc_1_stanfit)$loss_sample[,1,] %>%
as_data_frame() %>%
mutate(iter = 1:n()) %>%
gather("timelbl", "value", -iter) %>%
mutate(time = gsub("V", "", timelbl) %>% as.numeric())
ggplot(snapshot_dt %>% filter(acc_year == 1988)) +
geom_line (aes(x = time, y = value, group = iter)
,data = fitted_curves_tbl, alpha = 0.01) +
geom_line (aes(x = dev_lag, y = cum_loss), colour = 'red') +
geom_point(aes(x = dev_lag, y = cum_loss), colour = 'blue') +
expand_limits(y = 0) +
scale_y_continuous(labels = scales::dollar) +
ggtitle("Plot of 1988 Year Loss Development Against Posterior Distribution") +
xlab("Time") +
ylab("Loss")
Seems to be doing a reasonable job. Now we use it to predict forward. We start with 1993, where we have development of five years.
predict_cone_tbl <- extract(lc_1_stanfit)$loss_prediction[,6,] %>%
as_data_frame() %>%
mutate(iter = 1:n()) %>%
gather("timelbl", "value", -iter) %>%
mutate(time = gsub("V", "", timelbl) %>% as.numeric())
plot_predict <- ggplot(ppauto_dt %>% filter(grcode == 43, accidentyear == 1993)) +
geom_line (aes(x = time, y = value, group = iter)
,data = predict_cone_tbl, alpha = 0.01) +
geom_line (aes(x = developmentlag, y = cumpaidloss_b), colour = 'red') +
geom_point(aes(x = developmentlag, y = cumpaidloss_b), colour = 'blue') +
expand_limits(y = 0) +
scale_y_continuous(labels = scales::dollar) +
ggtitle("Plot of 1993 Year Loss Prediction") +
xlab("Time") +
ylab("Loss")
plot(plot_predict)
We now look at a later year, with less claim development.
predict_cone_tbl <- extract(lc_1_stanfit)$loss_prediction[,8,] %>%
as_data_frame() %>%
mutate(iter = 1:n()) %>%
gather("timelbl", "value", -iter) %>%
mutate(time = gsub("V", "", timelbl) %>% as.numeric())
plot_predict <- ggplot(ppauto_dt %>% filter(grcode == 43, accidentyear == 1995)) +
geom_line (aes(x = time, y = value, group = iter)
,data = predict_cone_tbl, alpha = 0.01) +
geom_line (aes(x = developmentlag, y = cumpaidloss_b), colour = 'red') +
geom_point(aes(x = developmentlag, y = cumpaidloss_b), colour = 'blue') +
expand_limits(y = 0) +
scale_y_continuous(labels = scales::dollar) +
ggtitle("Plot of 1995 Year Loss Prediction") +
xlab("Time") +
ylab("Loss")
plot(plot_predict)
stanfile <- 'losscurves_multiple.stan'
grcodes <- ppauto_dt[, .(pos_prem = all(earnedpremdir_b > 0)), by = grcode][pos_prem == TRUE, grcode]
use_grcode <- grcodes[1:15]
multi_dt <- ppauto_dt[developmentyear < 1998 &
grcode %in% use_grcode, .(grcode = grcode
,accyear = accidentyear
,premium = earnedpremdir_b
,devlag = developmentlag
,cumloss = cumpaidloss_b
,lossratio = cumpaidloss_b / earnedpremdir_b
)]
cohort_dt <- multi_dt[, .(maxtime = max(devlag), premium = unique(premium)), by = .(grcode, accyear)]
cohort_dt[, cohort_id := .I]
## grcode accyear maxtime premium cohort_id
## 1: 43 1988 10 957 1
## 2: 43 1989 9 3695 2
## 3: 43 1990 8 6138 3
## 4: 43 1991 7 17533 4
## 5: 43 1992 6 29341 5
## ---
## 146: 1767 1993 5 12533746 146
## 147: 1767 1994 4 13590797 147
## 148: 1767 1995 3 14401255 148
## 149: 1767 1996 2 14900682 149
## 150: 1767 1997 1 15065713 150
lst_standata <- list(growthmodel_id = 1 # Use weibull
,n_data = multi_dt[, .N]
,n_time = multi_dt[, length(unique(devlag))]
,n_cohort = cohort_dt[, length(unique(accyear))]
,n_org = cohort_dt[, length(unique(grcode))]
,n_cohortdata = cohort_dt[, .N]
,cohort_id = match(multi_dt$accyear, unique(cohort_dt$accyear))
,org_id = match(multi_dt$grcode, unique(cohort_dt$grcode))
,t_value = multi_dt[, sort(unique(devlag))]
,t_idx = multi_dt[, match(devlag, sort(unique(devlag)))]
,premium = multi_dt$premium
,loss = multi_dt$cumloss
,cohort_maxtime = cohort_dt$maxtime
)
mi_2_stanmodel <- stan_model(stanfile, verbose = TRUE)
##
## TRANSLATING MODEL 'losscurves_multiple' FROM Stan CODE TO C++ CODE NOW.
## successful in parsing the Stan model 'losscurves_multiple'.
mi_2_stanvb <- vb(mi_2_stanmodel
,data = lst_standata
)
##
## This is Automatic Differentiation Variational Inference.
##
## (EXPERIMENTAL ALGORITHM: expect frequent updates to the procedure.)
##
## Gradient evaluation took 0.000356 seconds
## 1000 iterations under these settings should take 0.356 seconds.
## Adjust your expectations accordingly!
##
## Begin eta adaptation.
## Iteration: 1 / 250 [ 0%] (Adaptation)
## Iteration: 50 / 250 [ 20%] (Adaptation)
## Iteration: 100 / 250 [ 40%] (Adaptation)
## Iteration: 150 / 250 [ 60%] (Adaptation)
## Iteration: 200 / 250 [ 80%] (Adaptation)
## Iteration: 250 / 250 [100%] (Adaptation)
## Success! Found best value [eta = 0.1].
##
## Begin stochastic gradient ascent.
## iter ELBO delta_ELBO_mean delta_ELBO_med notes
## 100 -2e+04 1.000 1.000
## 200 -1e+04 0.778 1.000
## 300 -1e+04 0.568 0.555
## 400 -1e+04 0.444 0.555
## 500 -1e+04 0.360 0.149
## 600 -9e+03 0.305 0.149
## 700 -9e+03 0.264 0.069
## 800 -9e+03 0.233 0.069
## 900 -9e+03 0.209 0.028
## 1000 -9e+03 0.189 0.028
## 1100 -9e+03 0.090 0.028
## 1200 -9e+03 0.036 0.020
## 1300 -9e+03 0.022 0.016
## 1400 -9e+03 0.015 0.014
## 1500 -8e+03 0.013 0.011
## 1600 -8e+03 0.011 0.011
## 1700 -8e+03 0.010 0.010
## 1800 -8e+03 0.009 0.009 MEAN ELBO CONVERGED MEDIAN ELBO CONVERGED
##
## Drawing a sample of size 1000 from the approximate posterior...
## COMPLETED.
mi_2_stanfit <- sampling(mi_2_stanmodel
,data = lst_standata
,iter = 500
,chains = 8
,verbose = TRUE
)
##
## CHECKING DATA AND PREPROCESSING FOR MODEL 'losscurves_multiple' NOW.
##
## COMPILING MODEL 'losscurves_multiple' NOW.
##
## STARTING SAMPLER FOR MODEL 'losscurves_multiple' NOW.
mi_2_draws <- extract(mi_2_stanfit, permuted = FALSE, inc_warmup = TRUE)
mi_2_monitor <- as.data.frame(monitor(mi_2_draws
,probs = c(0.025, 0.1, 0.25, 0.5, 0.75, 0.9, 0.975)
,print = FALSE))
mi_2_monitor$Parameter <- as.factor(gsub("\\[.*]", "", rownames(mi_2_monitor)))
plotdata_dt <- mi_2_monitor[, c('mean', '10%', '50%', '90%')]
setDT(plotdata_dt)
plotdata_dt[, variable := factor(rownames(mi_2_monitor)
,levels = rownames(mi_2_monitor))]
## mean 10% 50% 90% variable
## 1: 1.44414 1.39334 1.44251 1.49421 omega[1]
## 2: 1.38916 1.29145 1.38680 1.49091 omega[2]
## 3: 1.25634 1.20762 1.25482 1.30623 omega[3]
## 4: 1.29241 1.23350 1.29031 1.35451 omega[4]
## 5: 1.32006 1.25360 1.31828 1.38868 omega[5]
## ---
## 1919: 4.00000 4.00000 4.00000 4.00000 t_098[12]
## 1920: 5.00050 5.00000 5.00000 5.00000 t_098[13]
## 1921: 3.18100 3.00000 3.00000 4.00000 t_098[14]
## 1922: 4.98300 5.00000 5.00000 5.00000 t_098[15]
## 1923: -5460.77977 -5477.42885 -5460.37319 -5445.21673 lp__
ggplot(plotdata_dt[grep("hyper|sd_LR|_exp", variable)]) +
geom_point(aes(x = variable, y = mean)) +
geom_errorbar(aes(x = variable, ymin = `10%`, ymax = `90%`), width = 0) +
expand_limits(y = 0) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5)) +
xlab("Parameter") +
ylab("Value")
We want to look at the multiple carriers expected LR is by themselves
ggplot(plotdata_dt[grep("mu_LR_exp", variable)]) +
geom_point(aes(x = variable, y = mean)) +
geom_errorbar(aes(x = variable, ymin = `10%`, ymax = `90%`), width = 0) +
expand_limits(y = 0) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5)) +
xlab("Parameter") +
ylab("Value")